Detection of anomalous events

ABSTRACT

A system is described for receiving a stream of events and scoring the events based on anomalousness and maliciousness (or other classification). The system can include a plurality of anomaly detectors that together implement an algorithm to identify low-probability events and detect atypical traffic patterns. The anomaly detector provides for comparability of disparate sources of data (e.g., network flow data and firewall logs.) Additionally, the anomaly detector allows for regulatability, meaning that the algorithm can be user configurable to adjust a number of false alerts. The anomaly detector can be used for a variety of probability density functions, including normal Gaussian distributions, irregular distributions, as well as functions associated with continuous or discrete variables.

ACKNOWLEDGMENT OF GOVERNMENT SUPPORT

This invention was made with government support under Contract No.DE-AC05-00OR22725 awarded by the U.S. Department of Energy. Thegovernment has certain rights in the invention.

BACKGROUND

Anomaly detection is the search for items or events which do not conformto an expected pattern. The detected patterns are called anomalies andtranslate to critical and actionable information in several applicationdomains.

There are different categories of anomaly detection includingunsupervised and supervised anomaly detectors. Unsupervised anomalydetection techniques detect anomalies in an unlabeled test data setunder the assumption that the majority of the instances in the data setare normal by looking for instances that seem to fit least to theremainder of the data set. Supervised anomaly detection techniquesrequire a data set that has been labeled as “normal” and “abnormal” andinvolves training a classifier.

Anomaly detection is applicable in a variety of domains, such asintrusion detection, fraud detection, fault detection, event detectionin sensor networks, etc. Nonetheless, a particular type of intrusiondetection has remained problematic. More specifically, a zero-day attackor threat is an attack that exploits a previously unknown vulnerabilityin a computer application. Zero-day attacks are used or shared byattackers before the developer of the target software knows about thevulnerability. As such, they can be very difficult to defend against.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a system diagram for detecting anomalous events.

FIG. 2 is a flowchart of a method according to one embodiment fordetecting anomalous events.

FIG. 3 is a flowchart of a method according to another embodiment fordetecting anomalous events.

FIG. 4 is a graph illustrating a probability density function with atunable parameter α used to adjust an area below the tunable parameter.

FIG. 5 is a graph illustrating a Gaussian distribution curve forprobability density and the area associated with a tuning parameter.

FIG. 6 shows a graph of concentric ellipses associated with amultivariate Gaussian distribution.

FIG. 7 shows a probability function that is a weighted sum of multipleGaussian distributions.

FIG. 8 shows an embodiment wherein discrete random variables are used.

FIG. 9 is an example table showing anomaly scores in a networkenvironment.

FIG. 10 shows another example table with anomaly scores in a networkenvironment.

FIG. 11 depicts a generalized example of a suitable computingenvironment in which the described innovations may be implemented.

DETAILED DESCRIPTION

A system is described for receiving a stream of events and scoring theevents based on anomalousness and maliciousness (or otherclassification). The system can include a plurality of anomaly detectorsthat together implement an algorithm to identify low-probability eventsand detect atypical traffic patterns. The anomaly detector provides forcomparability of disparate sources of data (e.g., network flow data andfirewall logs.) Additionally, the anomaly detector allows forregulatability, meaning that the algorithm can be user configurable toadjust a number of false alerts. The anomaly detector can be used for avariety of probability density functions, including normal Gaussiandistributions, irregular distributions, as well as functions associatedwith continuous or discrete variables.

The system provides a probabilistic approach to anomaly detection.Specifically, given a data set with a probability distribution, adefinition of an anomaly score depends on the probability distribution.The system can be applicable to any distribution and is comparableacross disparate distributions. A threshold can be set which defines alldata points with an anomaly score greater than the threshold asanomalous. Comparability across thresholds means that the anomaly scorepicks off the same percentage of most anomalous data points as thedistribution changes, but the threshold is held constant. Thus, usingembodiments described herein, a network flow anomaly can be compared toa firewall log anomaly, which is challenging if they adopt different orad hoc anomaly detection approaches. Some embodiments can also provideregulatability in the sense that analysts can, in advance, set theproportion of false alerts. A definition of false alerts, in the contextof unsupervised learning, is in order. The anomaly detection method canbe applied to two types of data. The first is data produced similarly totraining data. The second is data produced by a different, unknownmethod. Data produced similarly to the training data but that is flaggedas anomalous is a false alert. The rate of false alerts does not dependon the choice of the second source of data as the model was trainedwithout access to that source. In cyber security, the size of the datasets are so large that even a relatively small number of false alertscan have an impact. For example, a false alert rate of 0.001 may seemsmall, but in a network with one million events per hour, there would be1000 false alerts per hour. Of course, reducing the false alert rateinevitably reduces the true alert rate as well.

A particular system 100 is shown in FIG. 1. Multiple input sources areshown at 110 and any number of input sources can be used (N inputsources are shown, where N is any integer value). The sources can be anyof a variety of hardware or software resources. A particular environmentin which the embodiments described herein can be used is a networkenvironment. In a network environment, the sources 110 can be networksources of different types. For example, the network sources can beworkstations, firewalls, DNS, web servers, etc. Thus, the sources canhave different platforms and generate disparate log files (e.g., the logfiles can relate to events associated with different variable types).The source 110 can also be a source that provides streaming data. Thelog file or input stream typically includes events, which are actions oroccurrences that are detected by a program or hardware component. Theevents can be associated with a time-stamp in conjunction with captureddata. The format of the input stream can vary depending on the source110, but generally the input streams include a network address (e.g., anIP address) and a time stamp. Other examples of the input stream includefirewall data that includes source and destination IP addresses, sourceand destination port addresses, a protocol, and a message value definedby the firewall. Still another example of the input stream can benetwork data that includes source and destination IP addresses, sourceand destination ports, a protocol, and bytes or packets associated withmessage data. Each log file or input stream can be received in itsrespective anomaly detector 120. A plurality of anomaly detectors 120can sit in parallel, one for each log file. In some embodiments, theanomaly detectors can include memory registers and processing capacityto implement a parser, which analyzes the syntax of the input streamingdata and transforms it into a standard format (e.g., JSON or othertext-based languages).

Each anomaly detector 120 generally produces an anomaly score that canbe used as an input to a comparator 130. The score is dependent on ananomalousness of the event. Thus, the anomaly detector can transform alog file or log streaming data into a score based on the anomalycalculation. Despite that the log files can be disparate, the anomalydetectors 120 can generate a comparable score through use of thefollowing equation:

A _(ƒ)(x):=−log₂ P _(ƒ)(ƒ(X)≦ƒ(x))

More generically, the formula can be as follows:

A _(f)(x)=−log_(b) Pf(f(X)≦f(x)) where b>1.

As discussed further below, the anomaly score A_(ƒ)(x) for a randomvariable X is predictable and P_(ƒ)(A_(ƒ)(x)) can be bounded independentof f under the assumption that X is generated according to thedistribution described by f. Each anomaly detector 120 can have anindependently adjustable tuning parameter that assists in making theoutput anomaly scores comparable.

The comparator 130 can generate an output score 140. A variety ofcomparators can be used, but in some embodiments the comparator cangenerate the output 140 using a linear combination of anomaly scores.The linear combination can use weighting parameters associated with auser weight input file 150 or an input policy document 160.

The above definition of anomalousness is based on the probability of theprobability. The advantages of this approach is that it allows forregulatability and comparability. This can be demonstrated byconsidering the threshold selection problem. If a threshold of, say, 10is set in advance, then an event with a probability 2⁻¹⁰ or lower wouldbe considered anomalous. Now consider a uniform discrete distribution.If it has 100 possible values, then none of the events are consideredanomalous. However, if it has 2000 possible values, then they are eachconsidered anomalous even though they are each maximally likely. Aregulatable definition of anomalousness would be self-adapting to thesevarious distributions. Intuitively, it is not just the rarity of theevent itself, but how surprising that rarity is in the context of thedistribution.

As indicated above, the different log files are generated from amultiplicity of sources 110, each with their own properties. Supposethat we observe two types of discrete variables, one has two possiblevalues and another that has a thousand possible values. An idealdefinition of anomalousness would apply to both distributions withoutrequiring the tuning of a parameter for each dataset, and yet wouldallow for the direct comparison of the anomalousness of values of onevariable with values of the other variable. A direct comparison can beaccomplished by considering not the rarity of the event itself, but howrare the rarity of an event is.

As discussed above, the anomaly detectors 120 can generate an anomalyscore for x using the following definition:A_(ƒ)(x):=−log₂P_(ƒ)(ƒ(X)≦ƒ(x)) wherein a (discrete or continuous)random variable X is given with probability density or mass function ƒdefined on domain D, with A_(ƒ):D→R≧₀

It is worth noting that A_(ƒ) is defined on D, the same domain as f. Thenegative log is used for numerical reasons and so that largeranomalousness corresponds to larger numbers. Also, since theprobabilities of interest are likely to be very close to zero, the useof log helps emphasize their differences. Nonetheless, the log need notbe included and is an implementation detail. The choice of base 2 forthe log is so that the anomalousness is, in an information theoreticsense, measured in bits.

FIG. 2 is a flowchart of a method according to one embodiment fordetecting anomalous events. In process block 210, a first log file froma first data source can be received. As previously indicated, the sourcecan be a network-based source (hardware or software) or a non-networkbased source. Virtually any source that produces a log file can be used,wherein the log file can be a discrete file or continuous streamingdata. In process block 220, a second log file can be received from asecond data source, wherein the second data source can be a differenttype than the first data source. An example of different data sourcesare those that use a different hardware platform or different softwareplatform. Alternatively, the different data sources can generate logfiles that include disparate event types. In one example, the disparateevent types can include different variable types (e.g., multinomialvariables vs. Gaussian mixture variables). In process block 230, a firstanomaly score can be generated. Thus, for example, an anomaly detectorcan receive the first log file and transform it into an anomaly scoreusing a form of the definition A_(ƒ)(x):=P_(ƒ)(ƒ(X)≦ƒ(x). This equationcan also include a multiplier, such as a constant or a log function toallow small discrepancies to be more easily spotted. In process block240, a second anomaly score can be generated using a similar definition.Thus, the second log file can be transformed into a second anomalyscore. Despite that the log files can contain disparate events, theanomaly detectors can generate scores that are directly comparable. Atuning parameter, described further below, can assist in making thescores at a comparable level. In process block 250, an automaticcomparison can be made, such as by detecting which anomaly score islarger. Various comparison techniques are well understood in the art andneed not be repeated. The method can easily be scaled to includeadditional log files by simply adding anomaly detectors in parallel.

FIG. 3 shows another embodiment of a method that can be used fordetecting anomalous events. In this embodiment, anomalous events relatedto a network are specifically addressed. In process block 310, aplurality of network events can be received from disparate networksources. The network sources can be from workstations, firewalls, DNS,Web servers, or other network devices. In process block 320, multipleanomaly scores can be calculated for each of the network events using aform of the equation, A_(ƒ)(x):=P_(ƒ)(ƒ(X)≦ƒ(x)), such asA_(ƒ)(x):=−log₂P_(ƒ)(ƒ(X)≦ƒ(x)). In process block 330, the multipleanomaly scores can be compared in any desired manner.

The above definition of an anomaly score can be interpreted with respectto the graph of the probability density, as shown in FIG. 4. Given anevent x, P_(ƒ)(ƒ(X)≦ƒ(x))=∫{t|ƒ(t)≦ƒ(x))} ƒ(t)dt, hence P_(ƒ)(ƒ(X)≦ƒ(x))equals the area of the shaded region, under the line defined by α. Thenegative log base two of that area is the bits of meta-rarity (i.e.,anomaly score).

An example of the definition can be illustrative of its benefits.Consider a discrete uniform distribution in which each event has thesame probability. For any x, the probability of the random variable Xhaving probability mass ƒ(X) less than or equal to ƒ(x) is one.Therefore, A_(ƒ)(x)=−log₂ 1=0 for all x in the distribution. This is aparticular case of the general observation that any mode of adistribution has anomaly score 0. Without selecting a threshold ortuning a parameter α, a conclusion can be made that a discrete uniformdistribution has no anomalies.

An advantage of this definition of anomaly is that A_(ƒ)(X) for a randomvariable X is predictable, and P_(ƒ)(Aƒ(X)) can be bounded independentof ƒ under the assumption that X is generated according to thedistribution described by ƒ.

The following theorem can be used to prove this: Let X be distributedaccording to probability distribution ƒ. Then the probability that theanomalousness exceeds α is no greater than 2^(−α). That is,

P _(ƒ)(A _(ƒ)(x)>α)≦2^(−α)

The more generic formula being P_(f)(A_(f)(x)≧α)≦b^(−α) where b>1.

By this theorem, the proportion of events flagged as false alerts at thea level is no more than 2^(−α) for samples generated according to ƒ. Inparticular, the number of false alerts at a given threshold isindependent of ƒ. Hence, false alerts can be regulated by selecting anappropriate α. Furthermore, if X is produced according to ƒ and Y isproduced according to g, then A_(ƒ)(X) and A_(g)(Y) are comparable sincethey are both negative log probabilities. This definition ofanomalousness therefore provides comparability across different sourceseven if each source is modeled using a different probabilitydistribution.

Note that the bits of meta-rarity definition of anomalousness has noparameters that need to be set arbitrarily or by experimentation. Thedefinition is, in this sense, self-tuning: it uses the distributionitself as sufficient context for determining anomalousness. Onereasonable way to use these advantages of the definition is to set athreshold based on the size (or throughput) of the data to be analyzed.If a cyber security data set has, say, one million events that will bescored for anomalousness, then setting a threshold at log₂1,000,000=19.93≈20=α should yield at most one false alert assuming thatevents are really generated according to ƒ by the theorem. Deviations inthe number of anomalies will indicate that the model (i.e., choice of ƒ)does not match the generating distribution. This mismatch could bebecause f was not properly selected or tuned, or it could be becausethere is another source of events. In either case, exploration of theanomalies will provide insight into both the state of the system andchanges within it.

EXAMPLES

A Gaussian example is considered first. Consider a Gaussiandistribution, where the above-described definition of anomalousness is(monotonically) equivalent to a z-score. The z-score essentiallycaptures the normalized distance to the mean and offers regulatabilityas the distributions are known. However, the z-score is specific toGaussian distributions, making comparability across differentdistributions difficult. FIG. 5 shows how the anomaly definition picksout the tails of a Gaussian distribution, in agreement with az-score-based definition of anomalousness.

For a Gaussian distribution with mean μ and standard deviation σ, theprobability density function ƒ: R→R_(≧0) is defined as

${f(x)} = {\frac{1}{\sqrt{2{\pi\sigma}}}{\exp\left( \frac{\left( {x - \mu} \right)^{2}}{2\sigma^{2}} \right)}}$

The Anomalousness is then given by

A _(ƒ)(x)=−log₂ P _(ƒ)(ƒ(X)≦ƒ(x)).

It can be seen that ƒ(y)≦ƒ(x) if and only if (y−μ)². The probability ofthe set of such y is then the sum of the tail probabilities, which canbe given in terms of the cumulative distribution function F. The lefttail has probability F(μ−|x−μ|) and the right tail has probability1−F(μ+|x−μ|). However, by the symmetry of the Gaussian distribution,these two probabilities are equal. Hence the anomalousness can bewritten as

A _(ƒ)(x)=−log₂(2F(μ−|x−μ|)

Typically, the observations are first standardized by defining

$z = {\frac{x - \mu}{\sigma}.}$

Let G denote the cumulative distribution function of the standardizedone-dimensional Gaussian. Then the anomalousness becomes

A _(ƒ)(z)=−log₂(2G(−|z|)

Evidently, the more x deviates from μ, the more anomalous. Therefore,the anomaly score is monotonically equivalent to the z-score. However,the anomalousness A_(ƒ) does not require the parametric assumption ofthe z-score.

For a given false alert rate and a given Gaussian distribution, anappropriate threshold can be deduced. This shows that the bits of raritydefinition provides regulatability in this case. However, the thresholddepends explicitly on the distribution parameters. Bits of rarity (at agiven threshold) gives different false alert rates for differentparameters. Hence the bits of rarity definition does not providecomparability across distributions.

A multimariate Gaussian example is considered next. For a k-dimensionalmultivariate Gaussian, the probability distribution function ƒ: R^(k)→Ris defined to be

${f(x)} = {\frac{1}{\sqrt{2{\pi\sigma}}}{\exp\left( \frac{\left( {x - \mu} \right)^{2}}{2\sigma^{2}} \right)}}$

where μ is the mean and Σ is the positive definite covariance matrix.(Here, x and μ are thought of column vectors, and υ′ for υ a columnvector is the transpose of υ, which is a row vector).

Note that ƒ is monotonic in −(χ−μ)^(t)Σ⁻¹(χ−μ). FIG. 6 shows the levelsets of the distribution. The anomalousness of an event χ is then theprobability of an event being outside that level curve. (Note that thisobservation uses the unimodality of ƒ.) Since Σ is positive definite,its square root S=√{square root over (Σ)} can be computed such thatΣ=S^(t)S. Then,

A _(ƒ)(x)=−log₂ P _(ƒ)(∥S ⁻¹(X−μ)∥≧∥S ⁻¹(x−μ)∥).

Thus, the definition of anomalousness agrees with common practice, as itidentifies the tails. In fact, this shows that the anomalousness of amultivariate Gaussian event is monotonically equivalent to itsMahalanobis distance, a common reparameterization used in machinelearning. Mahalanobis distance, like the z-score, can be used to provideregulatability, but fails to provide comparability across distributions.

The next example is a Gaussian Mixture. The two previous examples showthat the distance to the mean (appropriately normalized) provides areasonable definition of anomalousness for some distributions. However,it is problematic for multimodal distributions. The Mixture of Gaussiandistributions has a probability density function that is the weightedsum of multiple Gaussian distributions, as shown in FIG. 7. Potentially,the mean of a Mixture of Gaussian distribution could be an anomaly,since it can fall in a valley between the modes. This exampleillustrates that a general definition of anomalousness cannot be basedon identifying just the tails. The bits of rarity and the bits ofmeta-rarity definitions both capture the rare events in the middle ofthe distribution as anomalous.

The next example is for Multinomials. Bits of meta-rarity apply equallywell to discrete distributions. For these, the anomalousness of x is thelog base two of the sum of all probabilities less than or equal toP_(ƒ)(x). This is demonstrated in FIG. 8. Because of the comparability,a comparison can be made between the anomalousness of a multinomialvariable with the anomalousness of, say, a Gaussian mixture variable.Thus, different variable types can be used in generating an anomalyscore and the anomaly scores are comparable. Another advantage of thisapproach is that it extends to any random variable, inclusive of complexprobabilistic constructions, such as random graphs and stochasticprocesses.

The final example is specific to cyber security data sets. The exampledata set can include entries, wherein each entry is comprised of atimestamp, source IP address, source port, destination IP address,destination port, protocol, and message code. A variable can be derivedfrom a log. Then a probability distribution can be estimated. Finally,events can be scored for anomalousness.

The specific cyber security example is for IP-to-IP by role. For a givenobservation, the following pair can be extracted: (Source IP role,destination IP role). For each IP address, a role was assigned. Thepossible roles were Workstation, Firewall, DNS, Web Server, andUnassigned, which are abbreviated as W, F, D, S, and U. Othernetwork-based hardware or software resources can be used. Any such paircan be taken as the observed value of a random multinomial variable X.Let N_(a,b) be the number of firewall log entries with source IP role aand destination IP role b. The probabilities ƒ(a, b)) are estimated byN_(a,b)/Σ_(x,y)N_(x,y). (Incorporating priors would assist with thescoring of previously unseen events.) The anomalousness of a pair (a, b)is given by

${A_{f}\left( \left( {a,b} \right) \right)} = {{- \log_{2}}\frac{\Sigma}{\left\{ {\left( {x,y} \right){{f\left( {x,y} \right)} \leq {f\left( {a,b} \right)}}} \right\}}{{f\left( {x,y} \right)}.}}$

The observed probabilities and anomaly scores for the random variableare summarized in FIGS. 9 and 10.

The most anomalous communication originates from the firewall andterminates at a workstation with an anomalous score of 21.669. Indeedtraffic classified in this group represents communication specificallybetween the firewall and the log server. On one hand, the relative lackof communication conforms with expectations of standard networkbehavior; however, further analysis indicates that communication betweenthe firewall and the log server surprisingly terminates after 15 minutesof the 40 hour dataset.

The second most anomalous communication originates from the DNS serverand terminates at a workstation with an anomalous score of 16.311.Further analysis of this traffic indicates attacks on the DNS serverinvolving ports 135, 137, 139, and 445, all of which are associated withfile-sharing traffic.

The third most anomalous communication originates from the DNS serverand terminates at a DNS sever with an anomalous score of 14.472. Furtheranalysis of this traffic indicates normal DNS traffic. This is expectednetwork behavior. It is noteworthy that the traffic from DNS to DNS ismore anomalous than traffic from DNS to Unlabeled. This trend indicatesa possible loss of control of the DNS server.

In this example, the anomaly scoring served to identify atypical eventsin a streaming environment. The insights by this process provided auseful step in developing a comprehensive situational understanding ofthe network.

A principled probability-based definition of anomalousness has beendefined that is reasonable, general (in that it applies to anythingmodeled by a probability distribution), comparable (in that scores ofdisparate types of events can be compared), and regulatable (in that therate of false alerts can be set in advance).

The following proof can be used to further support the description.

Adopting measure theory notation

A _(ƒ)(x)=−log₂(μ{t:ƒ(t)≦ƒ(x)}).

Note that

{x:μ{t:ƒ(t)≦(x)}≦μ{t:ƒ(t)≦ƒ(y)}}=x:ƒ(x)≦ƒ(y)}.

Proposition A.1: Fix yεD. Then

P(A _(ƒ)(X)≧A _(ƒ)(y))=2^(−Aƒ(y))

Proof. P(A_(ƒ)(X)≧A_(ƒ)(y)) may be rewritten as

$\begin{matrix}{= {\mu \left\{ {x:{{A_{f}(x)} \geq {A_{f}(y)}}} \right\}}} \\\left. \left. {= {\mu \left\{ {{{x\text{:}} - {\log_{2}\left( {\mu \left\{ {t:{{f(t)} \leq {f(x)}}} \right\}} \right)}} \geq {{- \log_{2}}\left( {\mu \left\{ {t:{{f(t)} \leq {f(y)}}} \right\}} \right.}} \right.}} \right\} \right\} \\\left. {= {\mu \left\{ {x:{{\mu \left\{ {t:{{f(t)} \leq {f(x)}}} \right\}} \leq {\mu\left( {t:{{f(t)} \leq {f(y)}}} \right.}}} \right.}} \right\} \\{= {\mu \left\{ {x:{{f(x)} \leq {f(y)}}} \right\}}} \\{{= 2^{- {{Af}{(y)}}}},}\end{matrix}\quad$

which proves the proposition.

Recall that P(A_(ƒ)(X)≧α)=μ{x:A_(ƒ)(x)≧α. The proof can be broken intotwo cases.

Case 1: Suppose that for all yεD, A_(ƒ)(y)<α. The result is triviallytrue as {x:A_(ƒ)(x)≧α}=φ.Case 2: Now suppose that {x:A_(ƒ)(x)≧α}≠φ, so that there exists some ysuch that A(y)≧α. Then set r=inf{A_(ƒ)(x):A_(ƒ)(x)≧α}, and let x_(n)εDso that A_(ƒ)(x_(n))↓r. Hence,

${\left\{ {x:{{A_{f}(x)} \geq \alpha}} \right\} = {\bigcap\limits_{n = 1}^{\infty}\left\{ {x:{{A_{f}(x)} \geq {A_{f}\left( x_{n} \right)}}} \right\}}},$

the sets on the right being nested. By the finiteness of the measure,

$\begin{matrix}{{\mu \left\{ {x:{{A_{f}(x)} \geq \alpha}} \right\}} = {\frac{\lim}{n}\mu \left\{ {x:{{A_{f}(x)} \geq {A_{f}\left( x_{n} \right)}}} \right\}}} \\{= {\frac{\lim}{n}2^{- {{Af}{({xn})}}}}} \\{= {{\frac{\lim}{n}2^{- r}} \leq 2^{- \alpha}}}\end{matrix},$

since r≧α.

The following is a further a proof that any anomaly score on a finiteset can be realized by a probability distribution. Further, it is shownthat −log_(b) is not flexible enough to let every A-score be realizedvia our definition.

As described above, given a probability distribution ƒ, the followingthree characteristics should be realized by any anomaly score:

(1) A_(ƒ) respects the distribution; that is, ƒ(x)>ƒ(y)

A_(ƒ)(x)<A_(ƒ)(y).

(2) A_(ƒ) is defined on any distribution.

(3) A_(ƒ) is comparable across distributions. Specifically, fixing athreshold α so that we classify {x:A(x)≧α} as “anomalous,” and {xA(x)<α} as “non-anomalous” will specify a fixed percentage, say d %, andleast probable d % of the distribution as “anomalous” events (and themost probably (100−d) % as “non-anomalous”. Hence, regardless of thedistribution, for fixed threshold, the anomalous events are the mostrare for their given distribution.

Hence, the following definition was made: A_(f)(x)=−log₂ p(f(X)≦f(x)).

While this definition preserves the desired qualities of an anomalyscore, it is too restrictive. It can be argued that the anomaly scoringarises without first putting a probability distribution on the data.Moreover, it can be argued that using heuristic methods for anomalydetection (for example, clustering, density estimation) to find ananomaly score is implicitly imposing a probability distribution on thedata, and that making an explicit assumption of the probabilitydistribution is better. The theorem below proves this “implicitprobability distribution” exists if a member of the class of functions{log_(b)}_(b) is allowed to replace log₂.

Theorem: Let A: {x₁, . . . , x_(n)}→[0, ∞], with min_(xj), A(x_(j))=0.Then there exists b>1, and a probability distribution p on {x₁, . . . ,x_(n)} so that

${A(x)} = {- {{\log_{b}\left( {\sum\limits_{\{{x_{j}:{{p(x_{j})} \leq {p{(x)}}}}\}}\; {p\left( x_{j} \right)}} \right)}.}}$

Note that since {_(x1), . . . , x_(n)} is finite, assuming min_(xj),A(x_(j))=0 is without loss of generality via translation.

Proof. Since

$\left( {\sum\limits_{\{{x_{j}:{{p(x_{j})} \leq {p{(x)}}}}\}}\; {p\left( x_{j} \right)}} \right)$

is increasing in p(x), then p(x_(j))>p(x_(k))

A(x_(j))<A(x_(k)).

Without loss of generality, write A(x_(j)) in decreasing order, and letl be the number of values A takes. A₁, . . . , A_(l) are set to these lvalues, so that

A₁ := A(x₁) = … = A(x_(k₁)) A₂ := A(x_(k₁ + 1)) = … = A(x_(k₁ + k₂))⋮ A_(l) := A(x_(n − k_(l) + 1)) = … = A(x_(n))

where k is the multiplicity of A_(j). Notice A₁>A₂> . . . >A_(l)=0.

Set

p₁ := p(x₁) = … = p(x_(k₁)) p₂ := p(x_(k₁ + 1)) = … = p(x_(k₁ + k₂))⋮ p_(l) := p(x_(n − k_(l) + .)) = … = p(x_(n))

so the task is to find p₁, . . . , p_(l) so that the definitions of A(x)and p(x_(j)) hold.

However, these hold if and only if

p₁ < p₂ < … < p_(l) andk₁p₁ + … + k_(l − 1)p_(l − 1) + k_(l)p_(l) = b^(−A_(t))k₁p₁ + … + k_(l − 1)p_(l − 1) = b^(−A_(t − 1))         ⋮k₁p₁ = b^(−A_(t)).

Solving the system of equations inductively shows that

$p_{1} = \frac{b^{{- A}\; t}}{k_{1}}$${{{and}\mspace{14mu} {for}\mspace{14mu} 1} < j \leq l},{p_{j} = \frac{b^{- A_{j}} - b^{- A_{j - 1}}}{k_{j}}}$

Since A₁>A₂> . . . >A_(l)=0, b is chosen large enough so that p₁<p₂< . .. <p_(l) holds.

From the proof above, b can be chosen depending on the given score A;specifically, if one attempts to fix b á priori, then there are lequations to be solved, plus one inequality, but only l unknowns; hencewhile p_(i), . . . , p_(l) can be chosen to satisfy the equalities,there is no guarantee the required inequality will hold. To illustratethis, we give an example where b=2 fails.

Let A:{x₁, . . . , x₅}→[0, ∞) by A(x₁)=2, A(x₂)=A(x₃)=A(x₄)=1, andA(x₅)=0.

Now the necessary equations in proof above require

p₁ = b⁻² $p_{2} = \frac{b^{- 1} - b^{- 2}}{3}$ p₃ = 1 − b⁻¹.

So fixing, say, b=2 á priori gives

-   -   p₁=¼    -   p₂= 1/12    -   p₃=½        so that p₁<p₂<p₃ fails. Thus for this example, a larger b is        needed. An example of a b that clearly works in this case is        b=10. To see this satisfies the necessary inequality, the        following calculation can be done:    -   p₁= 1/100    -   p₂= 3/100    -   p₃= 9/10.

As a next step, the theorem can be applied to a larger set

, perhaps

⊂

^(n) to accommodate the continuous probability distribution case. Fromthe previous example, it can be noted that having A repeat the value 1three times (while A only took value 2 at one point) necessitated arelatively large value of b. In fact, the domain

is changed to a larger set, thereby allowing A(

) to be infinite, an increasingly large disparity can be chosen in themultiplicity of each value in A(

) so that no value of b suffices. Consider the following example where

=[−1, e], A(

)=N. It can be shown that no value of b can satisfy the necessaryinequality.

Let A:[−1, e]→[0, ∞] as follows,

${A(x)} = \left\{ \begin{matrix}0 & {x = 0} \\1 & {x \in \left( {0,1} \right\rbrack} \\2 & {x \in \left( {1,{1 + \frac{1}{2}}} \right\rbrack} \\3 & {x \in \left( {{1 + \frac{1}{2}},\left. {1 +} \middle| {\frac{1}{2} + \frac{1}{3!}} \right.} \right\rbrack} \\\vdots & \vdots \\n & \left( {{\sum\limits_{j = 1}^{n - 1}\frac{1}{j!}},{\sum\limits_{j = 1}^{n}\frac{1}{j!}}} \right\rbrack \\\vdots & \vdots\end{matrix} \right.$

A probability distribution, p(x) can be constructed so that

A(x)=−log_(b)∫_({t:p(t)≦p(x)}) p(t)dt.

As before since (∫_({t:p(t)≦p(x)})p(t)dt) is increasing in p(x), it canbe seen that

p(x _(j))>p(x _(k))

A(x _(j))<A(x _(k))

Hence, it is desirable to find countably many values, p_(j), j=0, 1, 2,. . . that give the respective value of p(x) on {x:A(x)=j}. Thus.

p ₀ >p ₁ >p ₂> . . .

Let k_(j)=measure of {x: A(x)=j}. In this case, it can be seen, k₀=1,and for j>0,

$k_{j} = {\frac{1}{j!}.}$

As in the proof of the theorem, this gives a (countable) system ofequations, namely

k₀p₀ + k₁p₁ + k₂p₂ + … = b⁰ k₁p₁ + k₂p₂ + … = b⁻¹ ⋮${\sum\limits_{j \geq n}{k_{j}p_{j}}} = b^{- n}$ ⋮

Inductively solving gives

p₀ = 1 − b⁻¹ $p_{1} = \frac{b^{- 1} - b^{- 2}}{k_{1}}$ ⋮$p_{n} = \frac{b^{- n} - b^{- {({n + 1})}}}{k_{n}}$ ⋮

It can be observed that for no value of b can the inequality p₀>p₁>p₂> .. . hold, for in order to satisfy this inequality, the following shouldbe true:

$\left. {\frac{b^{- {({j - 1})}} - b^{j}}{k_{j - 1}} > \frac{b^{i} - b^{- {({j + 1})}}}{k_{j}}}\Leftrightarrow{\frac{b^{{{- j} +})} - b^{j}}{b^{j} - b^{{- j} - 1}} > \frac{k_{j - 1}}{k_{j}}}\Leftrightarrow{b > \frac{k_{j - 1}}{k_{j}}} \right. = j$

for every j.

Consider a further definition A_(ƒ)(x)=g∞p(ƒ(X)<ƒ(x)) for any continuousfunction

g: [0, 1]→[0, ∞] that satisfies

-   -   (1) g(0)=∞.    -   (2) g(1)=0.    -   (3) g is strictly decreasing.

Notice g(0)=∞ is required so that the elements with probability 0 aremaximally anomalous, and similarly, g(1)=0 ensures the events withmaximal likelihood are least anomalous. Decreasing g ensures that A_(ƒ)respects the distribution given by ƒ. Continuity is required of g toprohibit any jumps in A_(ƒ) that are not caused by the underlyingprobability distribution. Notice that in many cases (e.g. discrete) therange of ƒ is discrete, so g is trivially continuous as g's domain{y:y=p(ƒ(X)≦ƒ(x)) for some x} is discrete.

Now it can be shown that given any anomaly score A(x) there exists afunction g and a probability distribution ƒ.

As another example, let A:[−1, e]→[0, ∞] as follows,

${A(x)} = \left\{ \begin{matrix}0 & {x = 0} \\1 & {x \in \left( {0,1} \right\rbrack} \\2 & {x \in \left( {1,{1 + \frac{1}{2}}} \right\rbrack} \\3 & {x \in \left( {{1 + \frac{1}{2}},\left. {1 +} \middle| {\frac{1}{2} + \frac{1}{3!}} \right.} \right\rbrack} \\\vdots & \vdots \\n & \left( {{\sum\limits_{j = 1}^{n - 1}\frac{1}{j!}},{\sum\limits_{j = 1}^{n}\frac{1}{j!}}} \right\rbrack \\\vdots & \vdots\end{matrix} \right.$

A probability distribution, p(x) can be constructed and a functionappropriate function g found so that

A(x)=q∘(∫_({t:p(t)≦p(x)}) p(t)dt).

As before since (∫_({t:p(t)≦p(x)})p(t)dt) is increasing in p(x), theequation p(x_(j))>p(x_(k)), A(x_(j))<A(x_(k)) must hold, and thereforeone must find countably many values, p_(j),j=0, 1, 2, . . . that givethe respective value of p(x) on {x:A(x)=j).

As before, let k_(j)=measure of {x:A(x)=j) so that k₀=1, and for j>0,

$k_{j} = {\frac{1}{j!}.}$

Now encountering our first effective difference, the new (countable)system of equations is

k₀p₀ + k₁p₁ + k₂p₂ + … = g⁻¹(0) k₁p₁ + k₂p₂ + … = g⁻¹(1) ⋮${\sum\limits_{j \geq n}{k_{j}p_{j}}} = {g^{- 1}(n)}$ ⋮

Inductively solving gives

p₀ = 1 − g⁻¹(1)$p_{1} = {{\frac{{g^{- 1}(1)} - {g^{- 1}(2)}}{k_{1}}\vdots p_{n}} = \frac{{g^{- 1}(n)} - {g^{- 1}\left( {n + 1} \right)}}{k_{n}}}$⋮

Notice the requirement that lim,

₀g(t)=∞ appears again here, as it is also necessary for p(x) to be aprobability distribution. Specifically,

${\sum\limits_{j}{k_{j}p_{j}}} = {{\lim\limits_{N}{\sum\limits_{j = 1}^{N}{k_{j}p_{j}}}} = {1 - {\lim\limits_{N}\; {{g^{- 1}(N)}.}}}}$

One can choose a g satisfying the requirements of the following equationp₀>p₁>p₂>. Or, conversely, any set {p_(j)}can be chosen satisfying thisequation, and deduce g.

For example, if we require p_(j)=2^((j+1)) (quick check: Σ₀^(∞)2^(−(j+1))=1 so this is a probability distribution) it can beobserved g from 10, and then check to make sure g decreases from ∞ to 0.Solving 10 inductively it can be seen that

g⁻¹(0) = 1 ${g^{- 1}(1)} = \frac{1}{2}$${g^{- 1}(2)} = {\frac{1}{2^{2}*{1!}} = \frac{1}{4}}$ ⋮${g^{- 1}(n)} = {1 - \frac{1}{2^{1}*{0!}} - \frac{1}{2^{2}*{1!}} - \ldots - \frac{1}{2^{n}*{\left( {n - 1} \right)!}}}$⋮

FIG. 11 depicts a generalized example of a suitable computingenvironment 1100 in which the described innovations may be implemented.The computing environment 1100 is not intended to suggest any limitationas to scope of use or functionality, as the innovations may beimplemented in diverse general-purpose or special-purpose computingsystems. For example, the computing environment 1100 can be any of avariety of computing devices (e.g., desktop computer, laptop computer,server computer, tablet computer, etc.).

With reference to FIG. 11, the computing environment 1100 includes oneor more processing units 1110, 1115 and memory 1120, 1125. In FIG. 11,this basic configuration 1130 is included within a dashed line. Theprocessing units 1110, 1115 execute computer-executable instructions. Aprocessing unit can be a general-purpose central processing unit (CPU),processor in an application-specific integrated circuit (ASIC) or anyother type of processor. In a multi-processing system, multipleprocessing units execute computer-executable instructions to increaseprocessing power. For example, FIG. 11 shows a central processing unit1110 as well as a graphics processing unit or co-processing unit 1115.The tangible memory 1120, 1125 may be volatile memory (e.g., registers,cache, RAM), non-volatile memory (e.g., ROM, EEPROM, flash memory,etc.), or some combination of the two, accessible by the processingunit(s). The memory 1120, 1125 stores software 1180 implementing one ormore innovations described herein, in the form of computer-executableinstructions suitable for execution by the processing unit(s).

A computing system may have additional features. For example, thecomputing environment 1100 includes storage 1140, one or more inputdevices 1150, one or more output devices 1160, and one or morecommunication connections 1170. An interconnection mechanism (not shown)such as a bus, controller, or network interconnects the components ofthe computing environment 1100. Typically, operating system software(not shown) provides an operating environment for other softwareexecuting in the computing environment 1100, and coordinates activitiesof the components of the computing environment 100.

The tangible storage 1140 may be removable or non-removable, andincludes magnetic disks, magnetic tapes or cassettes, CD-ROMs, DVDs, orany other medium which can be used to store information in anon-transitory way and which can be accessed within the computingenvironment 1100. The storage 1140 stores instructions for the software1180 implementing one or more innovations described herein.

The input device(s) 1150 may be a touch input device such as a keyboard,mouse, pen, or trackball, a voice input device, a scanning device, oranother device that provides input to the computing environment 1100.The output device(s) 1160 may be a display, printer, speaker, CD-writer,or another device that provides output from the computing environment1100.

The communication connection(s) 1170 enable communication over acommunication medium to another computing entity. The communicationmedium conveys information such as computer-executable instructions,audio or video input or output, or other data in a modulated datasignal. A modulated data signal is a signal that has one or more of itscharacteristics set or changed in such a manner as to encode informationin the signal. By way of example, and not limitation, communicationmedia can use an electrical, optical, RF, or other carrier.

Although the operations of some of the disclosed methods are describedin a particular, sequential order for convenient presentation, it shouldbe understood that this manner of description encompasses rearrangement,unless a particular ordering is required by specific language set forthbelow. For example, operations described sequentially may in some casesbe rearranged or performed concurrently. Moreover, for the sake ofsimplicity, the attached figures may not show the various ways in whichthe disclosed methods can be used in conjunction with other methods.

Any of the disclosed methods can be implemented as computer-executableinstructions stored on one or more computer-readable storage media(e.g., one or more optical media discs, volatile memory components (suchas DRAM or SRAM), or non-volatile memory components (such as flashmemory or hard drives)) and executed on a computer (e.g., anycommercially available computer, including smart phones or other mobiledevices that include computing hardware). The term computer-readablestorage media does not include communication connections, such assignals and carrier waves. Any of the computer-executable instructionsfor implementing the disclosed techniques as well as any data createdand used during implementation of the disclosed embodiments can bestored on one or more computer-readable storage media. Thecomputer-executable instructions can be part of, for example, adedicated software application or a software application that isaccessed or downloaded via a web browser or other software application(such as a remote computing application). Such software can be executed,for example, on a single local computer (e.g., any suitable commerciallyavailable computer) or in a network environment (e.g., via the Internet,a wide-area network, a local-area network, a client-server network (suchas a cloud computing network), or other such network) using one or morenetwork computers.

For clarity, only certain selected aspects of the software-basedimplementations are described. Other details that are well known in theart are omitted. For example, it should be understood that the disclosedtechnology is not limited to any specific computer language or program.For instance, the disclosed technology can be implemented by softwarewritten in C++, Java, Perl, JavaScript, Adobe Flash, or any othersuitable programming language. Likewise, the disclosed technology is notlimited to any particular computer or type of hardware. Certain detailsof suitable computers and hardware are well known and need not be setforth in detail in this disclosure.

It should also be well understood that any functionality describedherein can be performed, at least in part, by one or more hardware logiccomponents, instead of software. For example, and without limitation,illustrative types of hardware logic components that can be used includeField-programmable Gate Arrays (FPGAs), Program-specific IntegratedCircuits (ASICs), Program-specific Standard Products (ASSPs),System-on-a-chip systems (SOCs), Complex Programmable Logic Devices(CPLDs), etc.

Furthermore, any of the software-based embodiments (comprising, forexample, computer-executable instructions for causing a computer toperform any of the disclosed methods) can be uploaded, downloaded, orremotely accessed through a suitable communication means. Such suitablecommunication means include, for example, the Internet, the World WideWeb, an intranet, software applications, cable (including fiber opticcable), magnetic communications, electromagnetic communications(including RF, microwave, and infrared communications), electroniccommunications, or other such communication means.

The disclosed methods, apparatus, and systems should not be construed aslimiting in any way. Instead, the present disclosure is directed towardall novel and nonobvious features and aspects of the various disclosedembodiments, alone and in various combinations and subcombinations withone another. The disclosed methods, apparatus, and systems are notlimited to any specific aspect or feature or combination thereof, nor dothe disclosed embodiments require that any one or more specificadvantages be present or problems be solved.

In view of the many possible embodiments to which the principles of thedisclosed invention may be applied, it should be recognized that theillustrated embodiments are only preferred examples of the invention andshould not be taken as limiting the scope of the invention. Rather, thescope of the invention is defined by the following claims. We thereforeclaim as our invention all that comes within the scope of these claims.

We claim:
 1. A method of detecting anomalous events, comprising:receiving a first log file including a first plurality of events from afirst data source; receiving a second log file including a secondplurality of events from a second data source that is a different typethan the first data source; using the first log file, generating a firstanomaly score, the generation being derived from an area associated witha probability density function of the first log file; using the secondlog file, generating a second anomaly score, the generation beingderived from an area associated with a probability density function ofthe second log file; and comparing the first and second anomaly scores.2. The method of claim 1, wherein generating the first anomaly scoreincludes calculating the anomaly score using a functionP_(f)(f(X)≦f(x)), wherein f(X) is related to a probability of anoccurrence of an event, f(x) is a current event being analyzed, andP_(f) is a probability determination.
 3. The method of claim 1, whereingenerating the first anomaly score includes using the formulaA_(f)(x)=−log_(b)Pf(f(X)≦f(x)) where b>1.
 4. The method of claim 1,wherein α function used to calculate the anomaly score is tunablethrough user input.
 5. The method of claim 1, wherein generating thefirst anomaly score includes using the function P_(f)(A_(f)(x)≧α)≦b^(−α)wherein α is a tunable parameter to change a number of false alerts andb is any number >1.
 6. The method of claim 1, wherein the first andsecond log files include disparate event types.
 7. The method of claim1, wherein each of the first and second log files are streaming data. 8.The method of claim 1, wherein comparing the first and second anomalyscores includes generating a linear combination of the anomaly scoresusing weighted inputs.
 9. A computer-readable storage havinginstructions thereon for executing a method of detecting anomalousevents, the method comprising: receiving a plurality of input networkevents from disparate network sources; and calculating multiple anomalyscores for each of the plurality of input network events using afunction formed at least in part by the expression P_(f)(f(X)≦f(x)),wherein f(X) is related to a probability of an occurrence of an event,f(x) is a current event being analyzed, and P_(f) is a probabilitydetermination.
 10. The computer-readable storage of claim 9, whereingenerating the anomaly scores includes using the formulaA_(f)(x)=−log_(b)Pf(f(X)≦f(x)) where b>1.
 11. The computer-readablestorage of claim 9, wherein the function used to calculate the anomalyscores is tunable.
 12. The computer-readable storage of claim 9, whereingenerating the anomaly scores includes using the functionP_(f)(A_(f)(x)≧α)≦b^(−α) wherein α is a tunable parameter to change anumber of false alerts and b is any number >1.
 13. The computer-readablestorage of claim 9, wherein receiving the plurality of input networkevents includes receiving multiple log files that include disparateevent types.
 14. The computer-readable storage of claim 9, whereinreceiving the plurality of input network events includes receiving theinput network events as streaming data.
 15. The computer-readablestorage of claim 9, further including comparing the anomaly scores,wherein the comparing of anomaly scores includes generating a linearcombination of the anomaly scores using weighted inputs.
 16. Thecomputer-readable storage of claim 9, wherein the network sources areselected from a list including one or more of the following: a firewall,a workstation, a DNS server, and a web server.
 17. The computer-readablestorage of claim 9, wherein the network events include a timestamp, asource IP address, a destination IP address, a destination port, aprotocol and a message code.
 18. A system for detecting anomalousevents, comprising: a first anomaly detector for receiving a first logfile; a second anomaly detector for receiving a second log file; whereinthe first and second anomaly detectors calculate anomaly scores for therespective first and second log files, the anomaly detectors using afunction formed at least in part by the expression P_(f)(f(X)≦f(x)),wherein f(X) is related to a probability of an occurrence of an event,f(x) is a current event being analyzed, and P_(f) is a probabilitydetermination; and a comparator coupled to the anomaly detectors forcomparing the anomaly scores.
 19. The system of claim 18, wherein thefirst log file and the second log file are sourced from devices havingdifferent platforms.
 20. The system of claim 18, wherein the first andsecond anomaly detectors are tunable to change a number of false alerts.